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Abstract 

This paper addresses the general problem of modelling and learning rank data with ties. 
We propose a probabilistic generative model, that models the process as permutations over 
partitions. This results in super-exponential combinatorial state space with unknown numbers 
of partitions and unknown ordering among them. We approach the problem from the discrete 
choice theory, where subsets are chosen in a stagewise manner, reducing the state space per 
each stage significantly. Further, we show that with suitable parameterisation, we can still 
learn the models in linear time. We evaluate the proposed models on the problem of learning 
to rank with the data from the recently held Yahoo! challenge, and demonstrate that the models 
are competitive against well-known rivals. 

1 Introduction 

Ranking appears to be natural to humans as we often express preference over things. Conse- 
quently, rank data has been widely studied in statistical sciences (e.g. see ||20| for a comprehensive 
survey). More recently, the intersection between machine learning and information retrieval has 
resulted in a fruitful sub-area called learning to rank (e.g. see iflTl for a recent review), where the 
goal is to learn rank functions that can accurately order objects from retrieval systems. Broadly 
speaking, a rank is a type of permutation, where the ordering of objects has some meaningful 
interpretation - e.g. the rank of student performance in a class. Although we would like to obtain 
a complete ordering over a set of objects, often this is possible only in small sets. In larger sets, it 
is more natural to rate an object from a rating scale, and the result is that many objects may have 
the same rating. Such phenomena is common in large sets such as movies, books or web-pages 
wherein many objects may have tied ratings. 

This paper focuses on the modelling and learning rank data with ties. Previous work often 
involves paired comparisons (e.g. see iTTl lfTTIIIS?! ). ignoring simultaneous interactions among ob- 
jects. Such interactions can be strong - in the case of learning to rank, objects are often returned 
from a query, and thus clearly related to the query and to each other. We take an alternative ap- 
proach by modelling objects with the same tie as a partition, translating the problem into ranking 
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or ordering these partitions. This problem transformation results in a combinatorial problem- set 
partitioning with unknown numbers of subsets with unknown order amongst them. For a given 
number of partitions, the order amongst them is a permutation of the partitions being considered, 
wherein each partition has objects of the same rank. A generative view of the problem can then 
be as follows: Choose the first partition with elements of rank 1, then choose the next partition 
from the remaining objects with elements ranked 2 and so on. The number of partitions then does 
not have to be specified in advance, and can be treated as a random variable. The joint distribu- 
tion for each ordered partition can then be composed using a variant of the Plackett-Luce model 
lfT8lll23l . substituting object potentials by the partition potential. We propose two choices for 
these potential functions: First, we consider the potential of each partition to be the normalised 
sum of individual object potentials in that partition, leading to a simple normalisation factor in 
the estimation of the joint distribution. Second, we propose a MCMC based parameter estimation 
for the general choice of potential functions. We specify this model as the Probabilistic Model 
over Ordered Partitions. Demonstrating its application to the learning to rank problem, we use the 
dataset from the recently held Yahoo! challenge |28J. Besides the regular first-order features, we 
study second-order features constructed as the Cartesian product over the feature set. We show 
that our results both in terms of predictive performance and training time are competitive with 
other well-known methods such as RankNet |3J, Ranking SVM |15] and ListMLE jj?). With the 
choice of our proposed simple potential function, we get the added advantage of lower compu- 
tational cost as it is linear in the query size compared to quadratic complexity for the pairwise 
methods. 

Our main contributions are the construction of a probabilistic model over ordered partitions 
and associated inference and learning techniques. The complexity of this problem is super- 
exponential with respect to number of objects (A^) because both the number of partitions and 
their order are unknown - it grows exponentially as A?!/(2(ln2)^+') 121] pp. 396-397]. Our 
contribution is to overcome this computational complexity through the choice of suitable poten- 
tial functions, yielding learning algorithms with linear complexity, thus making the algorithm 
deployable in real settings. The novelty lies in the rigorous examination of probabilistic models 
over ordered partitions, extending earlier work in discrete choice theory ll9l lITSl 1231 . The signif- 
icance of the model is its potential for use in many applications. One example is the learning to 
rank with ties problem and is studies in this paper. Further, the model opens new potential appli- 
cations for example, novel types of clustering, in which the clusters are automatically ordered. 

2 Background 

In this section, we review some background in rank modelling and learning to rank which are 
related to our work. 

Rank models. Probabilistic models of permutation in general and of rank in particular have 
been widely analysed in statistical sciences (e.g. f20\ for a comprehensive survey). Since the 
number of all possible permutations over objects is A^!, multinomial models are only computa- 
tionally feasible for small (e.g. A^ < 10). One approach to avoid this state space explosion is to 
deal directly with the data space, i.e. based on the distance between two ranks. The assumption 
is that there exists a modal ranking over all objects, and what we observe are ranks randomly 
distributed around the mode. The most well-know model is perhaps the Mallows lT9l . where the 
probability of a rank decreases exponentially with the distance from the mode. Depending on 
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the distance measures, the model may differ; and the popular distance measures include those by 
Kendall and Spearman. The problem with this approach is that it is hard to handle the cases of 
multiple modes, with ties and incomplete ranking. 

Another line of reasoning is largely associated with the discrete choice theory (e.g. see ifTSl ). 
which assumes that each object has an intrinsic worth which is the basis for the ordering between 
them. For example, Bradley and Terry [[l] assumed that the probability of object preference is 
proportional to its worth, resulting in the logistic style distribution for pairwise comparison. Sub- 
sequently, Luce ITSl and Plackett |23| extended this model to multiple objects. More precisely, 
for a set of objects denoted by {xi,X2, ■■■^xn} the probability of ordering X[ y X2>- ... >~ xn is 
defined as 

where x, >- xj denotes the preference of object a, over xj, and (j) (x, ) e M is the worth of the object 
X;. The idea is that, we proceed in selecting objects in a stagewise manner; Choose the first object 
among objects with probability of (j>{xi)/'L'j=i ^i^j)^ ^^en choose the second object among 
the remaining — 1 objects with probability of (x2 ) / YJj=2 ^ i^j ) ^o on until all objects are 
chosen. It can be verified that the distribution is proper, that is P{xi y X2 )~ ... ^ xn) > and the 
probabilities of all possible orderings will sum to one. This paper will follow this approach as it 
is easily interpre table and flexible to incorporate ties and incomplete ranks. 

Finally, for completeness, we mention in passing the third approach, which treats a permuta- 
tion as a symmetric group and applying spectral decomposition techniques ISll lfTSl . 

Learning to rank. Learning-to-rank is an active topic in the intersection between machine 
learning and information retrieval (e.g. see iflTl for a recent survey). The basic idea is that 
we can learn ranking functions that can capture the relevance of an object (e.g. document or im- 
age) with respect to a query. Although it appears to be an application of rank theory, the setting 
and goal are inherently different from traditional rank data in statistical sciences. Often, the pool 
of all possible objects in a typical retrieval system is very large, and often changes over time. 
Thus, it is not possible to enumerate objects in the rank models. Instead, each object-query pair is 
associated with a feature vector, which often describes how relevant the object is with respect to 
the query. As a result, the distribution over objects is query-specific, and these distributions share 
the same parameter set. As discussed in iflTl . machine learning methods extended to ranking can 
be divided into: 

Pointwise approach which includes methods such as ordinal regression HHSl. Each query- 
document pair is assigned a ordinal label, e.g. from the set {0, 1,2, ...,M}. This simplifies the 
problem as we do not need to worry about the exponential number of permutations. The complex- 
ity is therefore linear in the number of query-document pairs. The drawback is that the ordering 
relation between documents is not explicitly modelled. 

Pairwise approach which spans preference to binary classification [Sl ITOl ifTSl methods, where 
the goal is to learn a classifier that can separate two documents (per query). This casts the ranking 
problem into a standard classification framework, wherein many algorithms are readily available, 
for example, SVM |15!l, neural network and logistic regression (3], and boosting ifTOl . The com- 
plexity is quadratic in number of documents per query and linear in number of queries. Again, 
this approach ignores the simultaneous interaction about objects within the same query. 
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Figure 1: Complete ordering (left) versus subset ordering (right). For the subset ordering, the 
bounding boxes represents the subsets of elements of the same rank. Subset sizes are 4,3, 1,2, 
respectively. 

Listwise approach which models the distribution of permutations pi fSSl fZ/l . The ultimate 
goal is to model a full distribution of all permutations, and the prediction phase outputs the most 
probable permutation. This approach appears to be most natural for the ranking problem. In fact, 
the methods suggested in lH IZTll are applications of the Plackett-Luce model. 

3 Modelling Sets with Ordered Partitions 
3.1 Problem Description 

LetX = {xi,X2, . . . ,JCAf} be a collection of objects. In a complete ranking setting, each objectx; 
is further assigned with a ranking index Ki, resulting in the ranked list of {xj^^ ^Xk^t ■ ■ tXji^} where 
;r = (;ri , . . . , n^) is a permutation over {1,2,... , A^}. For example, X might be a set of documents 
returned by a search engine in response to a query, and K\ is the index to the first document, 712 
is the index to second document and so on. Ideally n should contain ordering information for all 
returned documents: however, this task is not always possible for any non-trivial size due to the 
labor cost involvecQ. Instead, in many situations, during training a document is rafecQ to indicate 
the its degree of relevance for the query. This creates a scenario where more than one document 
will be assigned to the same rating - a situation known as 'ties'' in learning-to-rank. When we 
enumerate over each object x,- and putting those with the same rating together, the set of objects 
X can now be viewed as being divided into K partitions with each partition is assigned with a 
number to indicate the its unique rank ^ e {1,2, The ranks are obtained by sorting ratings 

associated with each partition in the decreasing order. Our essential contribution in this section is 
a probabilistic model over this set of partitions, learning its parameter from data, and performing 
inference. 

Consider a more generic setting in which we know that objects will be rated against an ordinal 
value from 1 to A" but do not know individual ratings. This means that we have to consider all 
possible ways to split the set X into exactly K partitions, and then rank those partitions from 1 
to K wherein the ^th partition contains all objects rated with the same value k. This is the first 
rough description of state space for our model. Formally, for a given K and the order among the 
partitions a, we write the set X = {x\ , . . . ^x^} as a union of K partitions 

X = \jf^,Xa^ (1) 

'We are aware that clickthrough data can help to obtain a complete ordering, but the data may be noisy. 
^We caution the confusion between 'rating' and 'ranking' here. Ranking is the process of sorting a set of objects in an 
increasing or decreasing order, whereas in 'rating' each object is given with a value indicating its preference. 
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where a = {G\,. . . ,Gk) is a permutation over {l,2,..,K} and each partition is a non-empty 
subset of objects with the same rating k. These partitions are pairwise disjoint and having cardi- 
nality range from 1 to A^. It is easy to see that when K ^ N, each X/^ is a singleton, a is now a 
complete permutation over {!,... ,A^} and the problem reduces exactly to the complete ranking 
setting mentioned earlier To get an idea of the state space, it is not hard to see that there are 
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K\ ways to partition and order X where 



K 



is the number of possible ways to divide a 
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set of objects into K partitions, otherwise known as Stirling numbers of second kind 
105]. If we consider all the possible values of K, the size of our state space is 
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which is also known in combinatorics as the Fubini's number 11211 pp. 396-397]. This is a super- 
exponential growth number For instance, Fubini(l) = 1, Fubini(3) = 13, Fubini(5) — 541 and 
Fubini (10) = 102,247,563. Its asymptotic behaviour can also be shown IISTl pp. 396-397] to 
approach A^! / (2 (^2)"^^ )d&N where we note that In (2) < 1, and thus it grows much faster 
than A^!. Clearly, for unknown K this presents a very challenging problem. In this paper, we shall 
present an efficient and a generic approach to tackle this state-space explosion. 



3.2 Probabilistic IVIodel over Ordered Partitions 

Return to our problem, our task now to model a distribution over the ordered partitioning of set X 
into K partitions and the ordering o = (Ci , . . . , cTr") among K partitions given in Eq ([T]): 

p{X)=p{Xa„...,Xa,) (3) 

A two-stage view has been given thus far: first X is partitioned in any arbitrary way so long as 
it creates K partitions and then these partitions are ranked, result in a ranking index vector CJ. 
This description is generic and one can proceed in different ways to further characterise Eq Q. 
We present here a generative, multistage view to this same problem so that it lends naturally 
to the specification of the distribution in Eq ( fTTI i: First, we construct a subset Xi from X by 
collecting all objects which (supposedly) have the largest ratings. If there are more elements 
in the the remainder set to be selected, we construct a subset X2 from whose 

elements have the second largest ratings. This process continues until there is no more object 
to be selected!^ An advantage of this view is that the resulting total number of partitions Kg 
is automatically generated, no need to be specified in advance and can be treated as a random 
variable. If our data truly contains K partitions then Ka should be equal to K. Using the chain 
rule, we write the joint distribution over ranked partitions as 

K K 

p{Xu... ,XkJ = p {Xi)Y[p {Xk\Xu... ,Xk-i) = pi {Xi)Y[Pk {Xk I Xi:^_i) (4) 

k=2 k=2 

where we have used Xi±_i = {Xi ,Xit_i} for brevity. 

^This process resembles the generative process of Plackett-Luce discrete choice model I18II23I . except we apply on 
partitions rather than single element. It clear from here that Plackett-Luce model is a special case of ours wherein each 
pailition reduces to a singleton. 
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3.3 Parameterisation, Learning and Inference 



It remains to specify the local distribution P{Xk | )■ Let us first consider what choices do we 

have after the first {k — I) partitions have been selected. It is clear that we can select any objects 
from the remainder set {X\Xiii_i} for our next partition kth. If we denote this remainder set by 
Rii = {X\Xi ±_ 1 } and Nk ~ \Rk \ is the number of remaining objects, then our next partitionX^ is a 
subset of Rj^; furthermore, there is precisely (2^^* — l) such non-empty subsets. Using the notation 
2^* to denote the power set of the set i.e, 2^* contains all possible non-empty subset^ of R, 
we are ready to specify each local conditional distribution in Eq ( fTTb as: 

P.(X.|X,,_0 = 4^ (5) 

where <t>i (5') > is an order-invarianj| set function defined over a set or partition S, and the 
summation in the denominator clearly makes the definition in Eq© a proper distribution. The set 
function ^k{') can also be interpreted as the potential function in standard probabilistic graphical 
models literature. 

Although the state space 2^* for this local conditional distribution is significantly smaller than 
the space of all possible ordered partitions of objects, it is still exponential as we have shown 
earlier to be 2^^* — 1 . In general, directly computing the normalising term is still not possible, 
let alone learning the model parameters. In what follows, we will study an efficient special case 
which has (sub)-quadratic complexity in learning, and a general case with MCMC approximation. 
We further term our Probabilistic Model over Ordered Partition as PMOP. 



3.3.1 Full-Decomposition PMOP 

Under a full-decomposition setting, we assume the following local additive decomposition at each 
Ath step: 

^k{Xk) = ^ L (6) 

The normalising term \Xit\ is to ensure that the probability is not monotonically increasing with 
number of objects in the partition. Given this form, the local normalisation factor represented 
in the denominator of Eq (|5]) can now efficiently represented as the sum of all weighted sums 
of objects. Since each object x in the remainder set R/^ participates in the same additive manner 
towards the construction of the denominator in Eq (|5]l, it must admit the following forrr|5 

I ^kiS) - E ^ L 0. (^) = C X (7) 

where C is some constant and its exact value is not essential under a maximum likelihood param- 
eter learning treatment (readers are referred to Appendix|A]for the computation of C). To see this, 
substitute Eq (|6]l and (|7]i into Eq (|5]i: 

''The usual understanding would also contain the empty set, but we exclude it in this paper, 
^i.e., the function value does not depend on the order of elements within the partition. 

^To illustrate this intuition, suppose the remainder set is = {a.b}, hence its power set, excluding 0, contains 3 
subsets {a} , {b} , {a,h}. Under the full-decomposition assumption, the denominator in Eq js) becomes (p + (p (rt) + 
5 ifa) + (''*)} = (1 + j)L;re{a,ii} [Kx]- The Constant term is C = | in this case. 
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logp iX, I X,,_,) . log . log . log - logC \X,\ 

(8) 

Since logC|Xi.| is a constant w.r.t the parameters used to parameterise the potential functions 
(j)k{-), it does not affect the gradient of the log-likelihood. It is also clear that maximising the 
likelihood given in Eq (fTTI i is equivalent to maximising each local log-likelihood function given 
in Eq (O for each k. Discarding the constant term in Eq (|8]l, we re-write it in this simpler form: 

logp {Xk I = log ^ gk {x I Where {x \ X,.,k_i) = "^^^f (9) 

Depend on the specific form chosen for maximising log-likelihood in the form of Eq (|9|l can 
be carried on in most cases. Gradient-based learning this type of model is generally takes time 
complexity . However, using dynamic programming technique, we show that if the function (pi^ (x) 
does not depend on its position k, then the gradient-based learning complexity can be reduced to 
linear in N. 

To see how, dropping the explicit dependency of the subscript k in the definition of (pk{-), 
we maintain an auxiliary array a^ = L^e^j. ^ (x) where aK^ — Lagx^ (x) ^"d a^ = a^^i + 
Y^xex^ (x) for k < Ka- Clearly a\Xa can be computed in time in a backward fashion. Thus, 
gk (•) in Eq (|9]l can also be computed linearly via the relation gj^ {x) = (x) /ai^. This also implies 
that the total log-Ukelihood can also computed linearly in A^. 

Furthermore, the gradient of log-likelihood function can also be computed linearly in A^. 
Given the likelihood function in Eq ( fTTI i. using Eq the log-likelihood function and its gra- 
dient, without explicit mention of the parameters, can be shown to b^ 

i^==logp(Xi,...,X;,J = f log^g,(x|Xi:,_i) = Elogi: — (10) 

k=v xeXt k=l xeXt 

d^=j:diog j: 0(.)_^5iog«,=i:^^^i^-i:i ^ cid 

k xex,, k k LxeXi.'Pix) k ^kxeRi, 

It is clear that the first summation over k in the RHS of the last equation takes exactly time 
since LfLi l^jtl = N. For the second summation over k, it is more involved because both k and 
Rk can possibly range from 1 to A^, so direct computation will cost at most A^(A^ — l)/2 time. 
Similar to the case of ak, we now maintain an 2-D auxiliary arra}0 bk ~ Lxes^. '^^(jc), where 
l^Ko ~ llxeXKa (x) t>k = + lLxex,.^(l' (x) for k < Ka- Thus, bi ^^, and therefore the 
gradient d^, can be computed in NF time in a backward fashion, where F is the number of 
parameters. 



3.3.2 General State PMOP and MCMC Inference 

In the general case without any assumption on the form of the potential function ( •) using only 
Eq ^ and the log-likelihood function and its gradient, again without explicit mention of the 

'To be more precise, for k = I v/e define Xi q to be 0. 

^This is 2-D because we also need to index the parameters as well as the subsets. 
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model parameter, are: 



^ ^ \ogp (Xi) + ^ \ogpk {Xk I (12) 

k=2 

d^=Y^d\og^k{Xk)-'z\ L (5 I ^1:^-1 )^ log*, (5) I (13) 

k=\ k=\ Ug2% J 

Clearly, both the distribution pi^ (X, | i- 1 ) and the expectation £5^2'^* ('^ I ^1 1 ) ^ log (5) 
are generally intractable to evaluate. In this paper, we make use of MCMC methods to approxi- 
mate pi^ {Xii I ). There are two natural choices: the Gibbs sampling and Metropolis-Hastings 
sampling. For Gibbs sampling we note that this problem can be viewed as sampling from a ran- 
dom field with binary variables. Each object is attached with binary variable whose states are 
either 'selected' or 'not selected' at A:th stage. Thus, there will be 2^* — 1 joint states in the ran- 
dom field, where we recall that A^,. is the total number of remaining objects after {k—\ )-th stage. 
The pseudo code for Gibbs and Metropolis-Hastings routines performed at kt\\ stage is illustrated 
in Alg. ([T]i- 



Algorithm 1 MCMC sampling approaches for PMOP in general case. 



Gibbs sampling 


Metropolis-Hastings sampling 


1 . Randomly choose an initial subset X^- 


1 . Randomly choose an initial subset X^. 


2. Repeat until stopping criteria met 


2. Repeat until stopping criteria met 


• For each remaining object x at stage k. 


• Randomly choose number of objects m. 


randomly select the object with the 


subject to 1 < m < A^,. 


probability 


• Randomly choose m distinct objects from 




remaining set Rj^ — to 




construct a new partition denoted by S 


where (^i^{X'^'') is the potential of the 


• Set Xi^ -(r- S with the probability of 
minll \ 


currently selected subset Xi, if x is 


included and <l>,(Xjr'') is when x is not. 





Finally, we note that in practical implementation of learning, we follow the proposal in lfT2l 
wherein for each local distribution at ^th round we run the MCMC for otily a few steps starting 
from the observed subset Xj^. This technique is known to produce a biased estimate, but empirical 
evidences have so far indicated that the bias is small and the estimate is effective. Importantly, it 
is very fast compared to full sampling. 



3.4 Learning-to-Rank with PMOP 

To conclude the presentation of our proposed model for probabilistic modelling over ordered 
partitions (PMOP), we present a specific application of PMOP for the problem of leaning-to- 
rank. The ultimate goal after training is that, for each query the system needs to return a list of 
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related objects and their ranking^ Slightly different from the standard rank setting in statistics, 
the objects in learning-to-rank problem are often not indexed (e.g. the identity of the object is not 
captured in any parameter). Instead, we will assume that for each query-object pair {q,x) we can 
extract a feature vector x''. Model distribution specified in this way is thus query-specific. As a 
result, we are not interested in finding the single mode for the rank distribution over all queriej*^ 
but in finding the rank mode for each query. 

At the ranking phase, suppose for a unseen query q a list of X'' = |xj , . . . ,jt;^^ | objects related 
to q is returned. The task is then to rank these objects in decreasing order of relevance w.r.t 
q. Enumerating over all possible ranking take an order of A^^! time. Instead we would like to 
establish a scoring function f{xi,w) G M for the query q and each object x returned where w is 
now introduced as the parameter. Sorting can then be carried out much more efficiently in the 
complexity order of NglogNi/ instead of Nql. The function specification can be a simple a linear 
combination of features f{x'^,w) = x^ or more complicated form, such as a multilayer neural 
network, can be used. 

In the practice of learning-to-rank, the dimensionality of feature vector x'^ is often remains 
the same across all queries, and since it is observed, we use PMOP described before to specify 
conditional model specific to q over the set of returned objects as follows. 

p{X'^\w)=p{Xlxl-.,Xl I w)=P{X\ I ^v)Y[p{Xl\Xl,_,,^v) (14) 

k=2 

We can see that Eq (fl4l) has exactly the same form of Eq ( [TtI i specified for PMOP, but applied 
instead on the query-specific set of objects X'^ and additional parameter w. During training, 
each query-object pair is labelled by a relevance score, which is typically an integer from the set 
{0, .■,M| where means the object is irrelevant w.r.t the query q, and M means the object is highly 
relevanilS The value of M is typically much smaller than A'^, thus, the issue of ties, described 
at the beginning of this section, occur frequently. In a nutshell, for each training query q and its 
rated associated list of objects a PMOP is created. The important parameterisation to note here 
is that the parameter w is shared across all queries; and thus, enabling ranking for unseen query 
in the future. 

Using the scoring function f {x,w) we specify the individual potential function (j) (•) in the 
exponential form: 

^kix,w) =exp{f{x,w)} 

The local potential function defined over for partition (X^) can now be explicitly constructed 
under full-decomposition (Subsection |3.3.1t and general case (Subsection 13.3.21 ) as respectively 
follows. 



Full-decomposition: (X^^) = — ^ ^ exp{/(x,w)} (15) 

'We note a confusion that may arise here is that, although during training each training query cj is supplied with a 
list of related objects and their ratings, during the ranking phase the system still needs to return a ranking over the list of 
related objects for an unseen query. 

'"This would lead to something like the static rank over all possible objects in the database - like those in Google's 
PageRank (2). 

"Note that generally K M+l because there may be gaps in rating scales for a specific query. 
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General case: {Xl) = exp <( £ / (x, w) \ (16) 

The gradient of the log-likelihood function can also be computed efficiently. For full-decomposition, 
it can be shown to be: 



dw Y^^^^ <pk {x, w) I^g^, 0^ (x, w) 

For the general case, the gradient of the log-likelihood function can be shown to be: 

d\ogp{xl\Xl,_,) 

where 

\^k I xGXi, 

The quantity p {X'^ \ i ) can be interpreted as the probability that the subset X^. is chosen out 
of all possible subsets at stage k, and Xj^ is the centre of the chosen subset. 

The expectation Y^Sk^i^k I ^'{■k-i'^^k is expensive to evaluate, since there are 2'^* — 1 possible 
subsets. Thus, we resort to MCMC techniques. We follow the suggestion in lfT2ll to start the 
Markov chain from the observed subset X/^ and run for a few iterations. The parameter update is 
stochastic 




where s^^is the centre of the subset sampled at iteration /, and Tj > is the learning rate, and « is 
number of samples. Typically we choose « to be small, e.g. n~ 1,2,3. 



4 Discussion 

In our specific choice of the local distribution in Eq (|5]l, we share the same idea with that of 
Plackett-Luce, in which the probability of choosing the subset is proportional to the subset's 
worth, which is realised by the subset potential. In fact, when we limit the subset size to 1, i.e. 
there are no ties, the proposed model reduces to the well-known Plackett-Luce models. 

It is worth mentioning that the factorisation in Eq ([TtI i and the choice of local distribution 
in Eq ^ are not unique. In fact, the chain-rule can be applied to any sequence of choices. For 
example, we can factorise in a backward manner 

p{X,,...,Xk,) = P,{Xk,) Y\ Pk{Xk\Xk+i:K,) (17) 

k=\ 

where X^/t+hA'o is a shorthand for {X^+i ,Xj^+2: ■■■^^Ka}- Interestingly, we can interpret this reverse 
process as subset elimination: First we choose to eliminate the worst subset, then the second 
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worst, and so on. This line of reasoning has been discussed in ||9l but it is limited to 1 -element 
subsets. However, if we are free to choose the parameterisation of pi^ {Xj^ \ Xk+i-x^) as we have 
done for pi^ (X^ | Xi-jt^i) in Eq (|5]l, there are not guarantee that the forward and backward factori- 
sations admit the same distribution. 

Our model can be placed into the framework of probabilistic graphical models (e.g. see 
|fT6lll22l ). Recall that in standard probabilistic graphical models, we have a set of variables, 
each of which receives values from a fixed set of states. Generally, variables and states are 
orthogonal concepts, and the state space of a variable do not explicitly depends on the states of 
other variable^. In our setting, the objects play the role of the variables, and their memberships 
in the subsets are their states. However, since there are exponentially many subsets, enumerating 
the state spaces as in standard graphical models is not possible. Instead, we can consider the ranks 
of the subsets in the list as the states, since the ranks only range from 1 to A^. Different from the 
standard graphical models, the variables and the states are not always independent, e.g. when the 
subset sizes are limited to 1, then the state assignments of variables are mutually exclusive, since 
for each position, there is only one object. Probabilistic graphical models are generally directed 
(such as Bayesian networks) or undirected (such as Markov random fields), and our PMOP can 
be thought as a directed model. The undirected setting is also of great interest, but it is beyond 
the scope of this paper. 

With respect to tie handling, most previous work focuses on pairwise models. The basic idea 
is to assign some probability mass for the event of ties Q ifTTl ll24l . For instance, denote by Xj >- xj 
the preference of x, over Xj, and by x, « xj the tie between the two objects, Rao and Kupper Il24l 
proposed the following models 

P{xi >- Xj) 



(j){xi) 



0(x,-) + 0</)(x;) 

p(x--x) = (e'^-m^dH^j) 

where > 1 is the parameter to control the contribution of ties. When = 1, the model reduces 
to the standard Bradley-Terry model yj . This method of ties handling is further studied in 1291 
in the context of learning to rank. Another method is introduced in Q, where the probability 
masses are defined as 



(j){xi) + (j){xj) + V^<j){xi)(j){xj) 
P(x'^x) = vV<^(x,-)(^(x,) 

where V > 0. The applications of these two tie-handling models to learning to rank are detailed 
in AppendixICl 

For ties of multiple objects, we can create a group of objects, and work directly on groups. 
For example, let Xi and Xj be two sport teams, the pairwise team ordering can be defined using 
the Bradley-Terry model as 

p{x.yx,)~ 



Note that, this is different from saying the states of variables are independent. 
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The extension of the Plackett-Luce model to multiple groups has been discussed in lfT4l . However, 
we should emphasize that this setting is not the same as ours, because the partitioning is known in 
advance, and the groups behave just like standard super-objects. Our setting, on the other hand, 
assumes no fixed partitioning, and the membership of the objects in a group is arbitrary. 



5 Evaluation 
5.1 Setting 

The data is from Yahoo! learning to rank challenge pS]. This is currently the largest dataset 
available for research. At the time of this writing, the data contains the groundtruth labels of 
473, 134 documents returned from 19,944 queries. The label is the relevance judgment from 
(irrelevant) to 4 (perfectly relevant). Features for each document-query pairs are also supplied by 
Yahoo!, and there are 519 unique features. 

We split the data into two sets: the training set contains roughly 90% queries, and the test set 
is the remaining 10%. Two performance metrics are reported: the Normalised Discounted Cu- 
mulative Gain at position T (NDCG@r), and the Expected Reciprocal Rank (ERR). NDCG@r 
metric is defined as 



NDCG@r =^ ^' ^ 

=1 



^ -^logjCi + O 



where r, is the relevance judgment of the document at position /, KiJ) is a normalisation constant 
to make sure that the gain is 1 if the rank is correct. The ERR is defined as 

ERR = Y.-y{ri)X\{\~y{rj)) where V(r) = ^1-1 

which puts even more emphasis on the top-ranked documents. 

For comparison, we implement several well-known methods, including RankNet 131, Rank- 
ing SVM ini and ListMLE 123. The RankNet and Ranking SVM are pairwise methods, and 
they differ on the choice of loss functions, i.e. logistic loss for the RankNet and hinge loss for 
the Ranking SVlvQ Similarly, choosing quadratic loss gives us a rank regression method, which 
we will call Rank Regress. From rank modelling point of view, the RankNet is essentially the 
Bradley-Terry model [l] applied to learning to rank. Likewise, the ListMLE is essentially the 
Plackett-Luce model. We also implement two variants of the Bradley-Terry model with ties han- 
dling, one by Rao-Kupper 1241 (denoted by PairTies-RK; this also appears to be implemented in 
l29l under the functional gradient setting) and another by Davidson ^\ (denoted by PairTies-D; 
and this is the first time the Davidson method is applied to learning to rank). See AppendixIClfor 
implementation details. 

There are three methods resulted from our framework (see description in Section [34] |. The 
first is the PMOP with full-decomposition (denoted by PMOP-FD), the second is with Gibbs sam- 
pling (denoted by PMOP-Gibbs), and the third is with Metropolis-Hastings sampling (denoted by 
PMOP-MH). 



'^Strictly speaking, RankNet makes use of neural networks as the scoring function, but tlie overall loss is still logistic, 
and for simplicity, we use simple perceptron. 
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First-order features 


Second-order features 


ERR NG@1 NG@5 


ERR NG@1 NG@5 


Rank Regress 0.4882 0.683 0.6672 


0.4971 0.7021 0.6752 


RankNet 0.4919 0.6903 0.6698 


0.5049 0.7183 0.6836 


Ranking SVM 0.4868 0.6797 0.6662 


0.4970 0.7009 0.6733 


ListMLE 0.4955 0.6993 0.6705 


0.5030 0.7172 0.6810 


PairTies-D 0.4941 0.6944 0.6725 


0.5013 0.7131 0.6786 


PairTies-RK 0.4946 0.6970 0.6716 


0.5030 0.7136 0.6793 


PMOP-FD 0.5038 0.7137 0.6762 


0.5086 0.7272 0.6858 


PMOP-Gibbs 0.5037 0.7105 0.6792 


0.5040 0.7124 0.6706 


PMOP-MH 0.5045 0.7139 0.6790 


0.5053 0.7122 0.6713 



Table 1: Performance measured in ERR and NDCG@T. PairTies-D and PairTies-RK are the 
Davidson method and Rao-Kupper method for ties handhng, respectively. PMOP-FD is the 
PMOP with full-decomposition, and PMOP-Gibbs/MH is the PMOP with Gibbs/Metropolis- 
Hasting sampling (see Section[33]for a description). 

For those pairwise methods without ties handling, we simply ignore the tied document pairs. 
For the ListMLE, we simply sort the documents within a query by relevance scores, and those with 
ties are ordered according to the sorting algorithm. All methods, except for PMOP-Gibbs/MH, are 
trained using the Limited Memory Newton Method known as L-BFGS. The L-BFGS is stopped 
if the relative improvement over the loss is less than 10^^ or after 100 iterations. As the PMOP- 
Gibbs/MH are stochastic, we run the MCMC for a few steps per query, then update the parameter 
using the Stochastic Gradient Ascent. The learning rate is fixed to 0. 1, and the learning is stopped 
after 1,000 iterations. 

As for feature representation, we first normalised the features across the whole training set 
to roughly have mean and standard deviation 1 . We then employ both the first-order features 
and second-order features (by taking the Cartesian product of first-order features). The rationale 
for the second-order features is that since the first-order features are selected manually based 
on Yahoo! experience, features are highly correlated. Thus second-order features may capture 
aspects not previously thought by feature designers. Since the number of second-order features 
is large, we perform a correlation-based selection. First, we compute the Pearson's correlation 
between each second-order feature with the label, then choose those features whose absolute 
correlation is beyond a threshold. For this particular data, we found the threshold of 0.15 is 
useful, although we did not perform an extensive search. The number of selected second-order 
features is 14, 188. 

5.2 Results 

The results are reported in Table [T] The following conclusions can be drawn. First, the use 
of second order features improves the performance for nearly all the baseline methods. In our 
algorithms, the second order features yield better performance for PMOP-FD (incorporating the 
full decomposition). 

Second, using either first or second order features, all our algorithms outperform the baseline 
methods. For example, the PMOP-MH wins over the best performing baseline, ListMLE, by 
1.82%, using first-order features. In our view, this is a significant improvement given the scope of 
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Pairwise models 


PMOP/ListMLE 


msx{ff{N^),e{NF)} 


ff{NF) 



Table 2: Learning complexity of models, where F is the number of unique features. For pairwise 
models, see AppendixiBlfor the details. 

the dataset. We note that the difference in the top 20 in the leaderboard of the Yahoo! challenge 
is just 1.56%. 

As for training time, the PMOP-FD is numerically the fastest method. Theoretically, it has 
the linear complexity similar to ListMLE. All other pairwise methods are quadratic in query size, 
and thus numerically slower The PMOP-Gibbs/MH is also linear in the query size, by a constant 
factor that is determined by the number of iterations. See Table|2]for a summary. 

6 Conclusions 

Addressing the general problem of ranking with ties, we have proposed a generative probabilistic 
model, with suitable parameterisation to address the problem complexity. We present efficient 
algorithms for learning and inference. We evaluate the proposed models on the problem of learning 
to rank with the data from the currently held Yahoo! challenge, demonstrating that the models 
are competitive against well-known rivals designed specifically for the problem, both in predictive 
performance and training time. 
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A Computing C 

Let us calculate the constant C in Eq (|7|. Let use rewrite the equation for ease of comprehension 

where 2^* is the power set with respect to the set Ri^, or the set of all non-empty subsets of Ri^. 
Equivalently 

If all objects are the same, then this can be simplified to 

jNk-i 

Nk 

where = \Rk\. In the last equation, we have made use of the fact that L^gi^A ^ '^he number 
of all possible non-empty subsets, or equivalently, the size of the power set, which is known to 
be 2^* — 1 . One way to derive this result is the imagine a collection of variables, each has 
two states: 'selected' and 'not selected', where 'selected' means the object belongs to a subset. 
Since there are 2^i< such configurations over all states, the number of non-empty subsets must be 

For arbitrary objects, let us examine the the probability that the object x belong to a subset of 
size m, which is ^r- Recall from standard combinatorics that the number of m-element subsets 

"k 

is the binomial coefficient (j^,,^^ where 1 < m < A^^., and . Thus the number of times an object 

Nk\ m_ 
m J Nk ■ 

(i.e. \S\ in Eq (|7])), the the contribution towards C is then ^ Finally, we can compute the 
constant C, which is the weighted number of times an object belongs to any subset of any size, as 
follows 

A'* 

c = L 

m=l 

2^^ - 1 

We have made use of the known identity j ( "* ) = 2'^* — 1 . 



appears in any ;7i-subset is ( ) -jj-- Taking into account that this number is weighted down by tn 





) Nk 


1 


A'A' 

L 


(m) 








m=l 
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B Pairwise Losses 



Let dij (w) = (p {xi ,w) ~ (p {xj , w), the pairwise losses are 

{log(l+exp(— 5,7(w))) for logistic loss in RankNet 

max{0, 1 — 5ij{w)} for hinge loss in Ranking SVM 

(1 — dij{w))^ for quadratic loss in Pair Regress 

The overall loss is then 

Loss = ^loss(x; >- Xj;w) 
i<i 

Taking derivative with respect to w yields 

5Loss r-, r-, 5loss(jc,- >- jcj; w) / d^(xi,w) (50(xj,w) 



' ]]]<' 



V ( V d(j){xi,w) ( 5loss(x/ '^Xj;w) \ d<j){xj,w) 

i[jv« )^ j\.k 

As it takes A^^ time to compute all the partial derivatives '^^'^^gg'.'^J-j'''^'^ for all i,j where j < i, 

the overall gradient requires A^^ +NF time. Thus the complexity of the pairwise methods is 

ff{max{N^,NF}). 

C Learning the Paired Ties Models 

This section describes the details of learning the paired ties models discussed in Section]?] 

Rao-Kupper method. Recall that the Rao-Kupper model defines the following probability 
masses 

P[Xi>~Xj\w) = 



P(xi -< Xj\w) = 



^{xi,w) + e^{xj,w) 

<j)ixj,w) 



e(j){xi,w) + (j){xj,w) 

[^{xi,w) + d(l){xj,w)] [9(l){xi,w) + (l){xj,w)] 
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where > 1 is the ties factor and w is the model parameter For ease of unconstrained optimisa- 
tion, let = 1 + for a G M. In learning, we want to estimate both a and w. Let 



(/)(x,-,H') + (l+e«)0(x,-,w) 

<^{^j,w) 

^(x;,w) + (l+e«)^(x;,w) 

^{xi,w) 

(l+e«)^(xi,w) + ^(Av,w) 

<P{xj.w) 

(1 +e«)0(x;,w) + 0(xj,w) 

Taking partial derivatives of the log-likelihood gives 



j 



Pj 



d\ogP{xi >~ Xj 


\w) 


dw 




d log P{xi >~ Xj 


\w) 


da 




d\ogP{xiKiXj 


\w) 


dw 




d log P{xi K, Xj 


\w) 


da 



dw dw 

= -Pje" 

= (i-p,-(i+Op;) ^'°gf"-'"^ (i-p,-(i+.")p;) ^'°gt^"^'"^ 

dw ■' dw 



,(l+e«)2-l ' \ 

Davidson method. Recall that in the Davidson method the probability masses are defined as 

^ixi,w) 



P{xiyxj\w) = 

P{xi^Xj\w) = 



(p {xi ,w) + (l){xj,w) + V^(l){xi,w)^ {xj , w) 

^{Xi,w) + {Xj,w) + V^0(x;,w)0(X;,w) 



(l){xi,w) + <l){xj,w) + Vy/(l){xi,w)<l){xj,w) 

where v > 0. Again, for simplicity of unconstrained optimisation, let v = e'^ for j3 G M. Let 

p, ^ <P{xi,w) 

(l){xi,w) + ^{xj,w) +eP ^(j){xi,w)^{xj,w) 
<j){xj,w) 



Pj 



P.! = 



<l){xi,w) + (pixj.w) +eP y/<l){xi,w)(l)(xj,w) 

y/^ixi,w)^{xj,w) 

{xi,w) + (l){xj,w)+eP y^^{xi,w)(j){xj,w) 
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Taking derivatives of the log-likelihood gives 



d log P{xj >~ Xj 


|w) 


dw 




d log P{xi >- Xj 


|w) 


dp 




d log P{xi K, Xj 


\w) 


dw 




dl0gP{xi K, Xj 


\w) 


dp 



.dlog0(xi,w) , , (9log0(x;,w) 

(1 -P,-0.5P,/) ^V' ' ' --{Pi + Q.5Pij)- ^' ' 



dw dw 



-Pu 



,dlog0(xi,w) , dlog(b(x, 
{0.5 -Pi- O.SPij) " ' + {0.5 -Pj- 0.5Pij) ^^-^ 



dw dw 
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